TAS: KEP update for API #3237

mimowo · 2024-10-15T14:16:08Z

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

To reflect API decision changes during implementation:

better comments
change the TopologyAssignment in workload:
- use domain rather than group consistently
- break nodeLabels for domains into order list of keys common for all domains, and ordered list of values. This saves a lot of space for large assignments, avoiding duplication of the keys.
- included example here

Which issue(s) this PR fixes:

Part of #2724

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

mimowo · 2024-10-15T14:16:25Z

/cc @tenzen-y @PBundyra

netlify · 2024-10-15T14:16:28Z

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

Name	Link
🔨 Latest commit	`864e36a`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/670fdb9ccc43e400084b493a

mimowo · 2024-10-15T14:18:53Z

/assign @PBundyra @tenzen-y

mimowo · 2024-10-15T14:30:40Z

/cc @mwielgus @alculquicondor

mwielgus · 2024-10-15T14:34:39Z

keps/2724-topology-aware-schedling/README.md

-  Groups []TopologyAssignmentGroup `json:"groups"`
+  Domains []TopologyDomainAssignment `json:"domains"`
+
+  // levels is an ordered list of keys denoting the levels of the assigned


What is the purpose of this field?

TopologyUngater needs the information about the keys to construct full node selector when ungating pods.

This information is also present in the Topology object, but I think it is useful to also put it here.

Otherwise we need to complicate the code of the Ungater to lookup via ResourceFlavor to the Topology object, and deal with possible deletions / changes of the API objects in the meanwhile.

The list has at most 5 elements (levels), so the Workload object size should not be an issue

Ok, put this field above domains.

mwielgus · 2024-10-15T14:58:30Z

keps/2724-topology-aware-schedling/README.md

-  // +kubebuilder:validation:MinItems=1
-  NodeLabels map[string]string `json:"nodeLabels"`
+type TopologyDomainAssignment struct {
+  // values is an ordered list of node selector values describing a topology


It would be clearer if you duplicated this description in TopologyAssignment.

Sure, I've updated the TopologyAssignment description to include the information on how the node selector is specified. PTAL.

mwielgus · 2024-10-16T09:22:46Z

keps/2724-topology-aware-schedling/README.md

  //
+  // +required
  // +listType=atomic
  // +kubebuilder:validation:MinItems=1
  // +kubebuilder:validation:MaxItems=5


Let's be a bit more generous here and bump it to 8, so that it covers all possible use cases, and and the same time has a sane max limit.

PBundyra · 2024-10-16T09:28:29Z

/lgtm
/hold so @tenzen-y can take a look

k8s-ci-robot · 2024-10-16T09:28:35Z

LGTM label has been added.

Git tree hash: cb7590a3326b99750053131557778fc63df34b2f

mwielgus · 2024-10-16T10:27:35Z

/lgtm

k8s-ci-robot · 2024-10-16T10:27:41Z

LGTM label has been added.

Git tree hash: b6669556f015d2be83940c4a9d0a8277537c2614

tenzen-y · 2024-10-16T10:39:41Z

keps/2724-topology-aware-schedling/README.md

  // +kubebuilder:validation:MinItems=1
-  NodeLabels map[string]string `json:"nodeLabels"`
+  // +kubebuilder:validation:MaxItems=8
+  Values []string `json:"values"`


I think the values are a slightly generic and It's hard to recognize the objective for this field since the topologyDomainAssignment does not have the context for the nodeSelector.

Will we receive the requests except nodeSelector keys here?

I think the values are a slightly generic and It's hard to recognize the objective for this field since the
topologyDomainAssignment does not have the context for the nodeSelector.

I agree, this is why I mention the node selector in the comment. We could name the field nodeSelectorValues but it seems verbose.

Will we receive the requests except nodeSelector keys here?

The node selectors are create by combining only the keys (.levels field) and values (.domains.values) fields.

Will we receive the requests except nodeSelector keys here?

The node selectors are create by combining only the keys (.levels field) and values (.domains.values) fields.

Sorry, I meant the nodeSelector values.

I agree, this is why I mention the node selector in the comment. We could name the field nodeSelectorValues but it seems verbose.

In that case, How about the levelValues since these values are for the levels?

In that case, How about the levelValues since these values are for the levels?

As a name it sgtm, but maybe the only potential downside is grpc size if we have many domains assigned? I guess for big workloads it could be 10k domains or more (if the lowest level is individual node). WDYT?

I guess that levelValues does not significantly impact on the gRPC message size since the field name increase just 0.05 MiB in the 10k domains situations.

Additionally, I guess that the TAS will be used for the Top of Rack Switch (ToR) or Leaf or Spine switches instead of Nodes since the topology often needs to be considered based on the network topology, right?

Let me know if you have the actual or imaginable situations where we need to construct the topology based on the Nodes.

Additionally, I guess that the TAS will be used for the Top of Rack Switch (ToR) or Leaf or Spine switches instead > of Nodes since the topology often needs to be considered based on the network topology, right?

Later maybe yes, for now we just want to ensure the pods are lending on closely connected nodes.

I think there could be use cases to assign at the level of nodes to ensure there is no issue with fragmentation of quota. For example, if you assign at the level of rack, it could happen that a Pod can fit in a rack, but cannot fit on an individual node. I guess it will need to depend on the user preference.

I guess that levelValues does not significantly impact on the gRPC message size since the field name increase just 0.05 Mbit in the 10k domains situations.

Yes, it is not much, but the overall API object size in etcd is 1.5Mi which is not much too, so the gain around 3% of the max size.

For example, if you assign at the level of rack, it could happen that a Pod can fit in a rack, but cannot fit on an individual node. I guess it will need to depend on the user preference.

I feel that this usage beyonds the initial TAS motivations. But, I know that we received similar requests here: #3211

Yes, it is not much, but the overall API object size in etcd is 1.5Mi which is not much too, so the gain around 3% of the max size.

Yes, that's true. Here, key problem is which disadvantages should we take. I think that both selection have the below disadvantages:

values: This lacks context and background that this field is for the values for nodeSelectors defined in the level. After we read the API document, we can understand the purpose for this field.

levelValues: This will increase the Workload object size. This may cause the etcd performance issues.

Based on the fact that Workload object is for Kueue developer and not exposed to the batch users, I am leaning toward accepting values. In that case, could you mention this values vs levelValues discussion on the Alternative? I guess that we can revisit this alternative based on user feedback like batch admins often struggle caused by this field name?

tenzen-y · 2024-10-16T10:43:12Z

keps/2724-topology-aware-schedling/README.md

  // +required
  // +listType=atomic
  // +kubebuilder:validation:MinItems=1
-  Groups []TopologyAssignmentGroup `json:"groups"`
+  // +kubebuilder:validation:MaxItems=8
+  Levels []string `json:"levels"`


Which objectives does this field "Storing desired state" or "temporary cache for the computing by topologyUngator"?
I guess that this is for the "temporary cache for the computing by topologyUngator".

In that case, could we put this information in the internal cache instead of CRD?

It is desired state to be reached by the Pods. We cannot use internal state, because then it would be lost on Kueue restart.

Does that mean that the topologyUngator can not obtain levels from Topology CR when the Kueue restarts?

Oh, I see what you mean. Similar question was asked in this comment.

It makes the implementation of TopologyUngater simpler (as it would require lookups via RF API Topology API, and dealing with changes to the object), at the very small cost in terms of object size as the number of levels is quite limited.

Also, I think it is conceptually consistent with what we do here as we store the mapping between resources and flavors, rather than reading resources from the CQ API object.

as it would require lookups via RF API Topology API, and dealing with changes to the object

I think that this could be resolved by the dedicated internal cache to store the relarationships between RF and Topology.

at the very small cost in terms of object size as the number of levels is quite limited.

Yes, I agree with you. My doubt was that the field is just cache since we can obtain from another objects.

Also, I think it is conceptually consistent with what we do here as we store the mapping between resources and flavors, rather than reading resources from the CQ API object.

Uhm, I see. Indeed the levels (new field) and flavors (existing field) field is stored in the Workload status, and it seems that reduce the multiple API calls (RF -> Topology).

When we graduate API version to v1beta2, we may want to restructure the Workload status field so that we can store the actual assigned flavor and topology information in a same level field.

Anyway, I'm ok with adding the levels here. Thanks.

I think that this could be resolved by the dedicated internal cache to store the relarationships between RF and Topology.

sure, it looks possible, I get the point, but I think it would complicate the implementation to maintain the cache, rather than reading from the object.

When we graduate API version to v1beta2, we may want to restructure the Workload status field so that we can store the actual assigned flavor and topology information in a same level field.

Sure, I'm ok to add it to the KEP to re-evaluate this decision.

Anyway, I'm ok with adding the levels here. Thanks.

Thanks!

When we graduate API version to v1beta2, we may want to restructure the Workload status field so that we can
store the actual assigned flavor and topology information in a same level field.
Sure, I'm ok to add it to the KEP to re-evaluate this decision.

let me know if you want that to be added, though I think it is unlikely to be dropped. For example, if we read from cache the keys could be changed in the meanwhile, and then the domain assignement would be corrupted. If we know the keys at the moment of assignment we could compare them with the ones in the cache (maintained based on the Topology API) and evict the workload.

sure, it looks possible, I get the point, but I think it would complicate the implementation to maintain the cache, rather than reading from the object.

Yeah, that's true. I think that this is a trade-off between implementation costs and API maintenance costs since the dedicated cache could prevent exposing API and easily break the structure, but it has implementation costs.

Anyway, I do not claim a dedicated cache now.

For example, if we read from cache the keys could be changed in the meanwhile, and then the domain assignement would be corrupted.

This could be preventented by the event based cache updating same as today cache package (/pkg/cache).
I'm wondering if we can put the dedicated cache mechanism instead of the status field on the alternative section.
But this is nonblocking of this PR merging.

I can update the alternative section in a follow up

Yeah, sure.

tenzen-y

Thank you for the update!
There are not blocker, and some commenters can be addressed in the follow-ups.
/lgtm
/approve

k8s-ci-robot · 2024-10-16T15:56:19Z

LGTM label has been added.

Git tree hash: db598e6a06d554912fabf11c8e6fba23db305b46

k8s-ci-robot · 2024-10-16T15:56:20Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mimowo, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [mimowo,tenzen-y]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tenzen-y · 2024-10-16T15:56:26Z

/hold cancel

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Oct 15, 2024

k8s-ci-robot requested review from denkensk and tenzen-y October 15, 2024 14:16

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Oct 15, 2024

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Oct 15, 2024

k8s-ci-robot requested a review from PBundyra October 15, 2024 14:16

mimowo force-pushed the tas-kep-update branch from 98a9426 to e1a1325 Compare October 15, 2024 14:17

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Oct 15, 2024

k8s-ci-robot assigned PBundyra and tenzen-y Oct 15, 2024

mimowo mentioned this pull request Oct 15, 2024

TAS: Introduce API #3235

Merged

k8s-ci-robot requested review from alculquicondor and mwielgus October 15, 2024 14:30

mwielgus reviewed Oct 15, 2024

View reviewed changes

mimowo force-pushed the tas-kep-update branch from e1a1325 to 8980d61 Compare October 16, 2024 09:10

mwielgus reviewed Oct 16, 2024

View reviewed changes

mimowo force-pushed the tas-kep-update branch from 8980d61 to 9a36b3a Compare October 16, 2024 09:27

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Oct 16, 2024

mimowo force-pushed the tas-kep-update branch from 9a36b3a to 5950e78 Compare October 16, 2024 09:34

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 16, 2024

k8s-ci-robot assigned mwielgus Oct 16, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 16, 2024

tenzen-y reviewed Oct 16, 2024

View reviewed changes

TAS: KEP update with API changes

864e36a

mimowo force-pushed the tas-kep-update branch from 5950e78 to 864e36a Compare October 16, 2024 15:28

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 16, 2024

k8s-ci-robot requested review from mwielgus and tenzen-y October 16, 2024 15:28

tenzen-y reviewed Oct 16, 2024

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 16, 2024

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 16, 2024

k8s-ci-robot merged commit 13e6b2b into kubernetes-sigs:main Oct 16, 2024
7 checks passed

k8s-ci-robot added this to the v0.9 milestone Oct 16, 2024

mimowo mentioned this pull request Oct 17, 2024

Update TAS KEP after API discussions #3251

Merged

PBundyra pushed a commit to PBundyra/kueue that referenced this pull request Nov 5, 2024

TAS: KEP update with API changes (kubernetes-sigs#3237)

45afcdd

kannon92 pushed a commit to openshift-kannon92/kubernetes-sigs-kueue that referenced this pull request Nov 19, 2024

TAS: KEP update with API changes (kubernetes-sigs#3237)

f3dea27

TAS: KEP update for API #3237

TAS: KEP update for API #3237

Conversation

mimowo commented Oct 15, 2024 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

mimowo commented Oct 15, 2024

netlify bot commented Oct 15, 2024 • edited Loading

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

mimowo commented Oct 15, 2024

mimowo commented Oct 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PBundyra commented Oct 16, 2024

k8s-ci-robot commented Oct 16, 2024

mwielgus commented Oct 16, 2024

k8s-ci-robot commented Oct 16, 2024

Choose a reason for hiding this comment

mimowo Oct 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y Oct 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y Oct 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimowo Oct 16, 2024 • edited Loading

Choose a reason for hiding this comment

mimowo Oct 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Oct 16, 2024

k8s-ci-robot commented Oct 16, 2024

tenzen-y commented Oct 16, 2024

mimowo commented Oct 15, 2024 •

edited

Loading

netlify bot commented Oct 15, 2024 •

edited

Loading

mimowo Oct 16, 2024 •

edited

Loading

tenzen-y Oct 16, 2024 •

edited

Loading

tenzen-y Oct 16, 2024 •

edited

Loading

mimowo Oct 16, 2024 •

edited

Loading

mimowo Oct 16, 2024 •

edited

Loading